Skip to content

Conversation

casteryh
Copy link
Contributor

This pull request introduces a robust concurrency control mechanism for RDMA operations in TorchStore by adding an asynchronous executor and refactoring the codebase to ensure safe, sequential execution of RDMA requests. The changes also update client and storage interfaces to support this executor, improve error handling, and clarify environment variable defaults.

Concurrency and RDMA execution improvements:

  • Added OnceCell and SequentialExecutor classes in torchstore/_async_utils.py to provide async initialization and sequential execution for RDMA tasks, ensuring only one RDMA operation runs at a time.
  • Refactored LocalClient and storage classes to accept and use SequentialExecutor for RDMA operations, including passing the executor through all relevant layers and enforcing concurrency limits via a semaphore and method decorator. [1] [2] [3] [4] [5] [6] [7] [8] [9]

API and initialization changes:

  • Updated client cache in torchstore/api.py to use OnceCell[LocalClient] for lazy, single initialization and refactored related functions to support this pattern. [1] [2] [3]

Transport buffer interface updates:

  • Modified TransportBuffer and RDMATransportBuffer to require an executor for RDMA read/write operations, enforcing sequential execution and updating method signatures. [1] [2] [3] [4]

Configuration and minor fixes:

  • Changed the default for the TORCHSTORE_RDMA_ENABLED environment variable to "1", enabling RDMA by default.
  • Minor typo and docstring corrections in storage volume and buffer classes.

These changes collectively ensure safe and efficient RDMA usage in TorchStore, improve initialization and resource management, and lay groundwork for future scalability.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Oct 11, 2025
@casteryh casteryh assigned kaiyuan-li and unassigned kaiyuan-li Oct 11, 2025
@casteryh casteryh requested a review from kaiyuan-li October 11, 2025 06:00
@casteryh casteryh marked this pull request as ready for review October 11, 2025 06:35
@codecov-commenter
Copy link

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

Thanks for integrating Codecov - We've got you covered ☂️

@casteryh casteryh marked this pull request as draft October 11, 2025 19:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants